-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
statusccl: temporarily serve /_status/nodes to tenants #93268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only have an idea to consider just inlining the code.
Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @matthewtodd)
pkg/server/fanout_clients.go
line 106 at r1 (raw file):
}) } return &resp, err
Don't feel too strongly but since there's no matching impl for the kvFanoutClient
, I'd just inline this in the Nodes
handler.
pkg/server/status.go
line 1516 at r1 (raw file):
} return s.serverIterator.nodes(ctx)
FYI: you probably checked already but this could create some bad failure state in DB Console. It's used to getting an unimplemented 5xx error on this endpoint for tenants, and this will lead it to get a 200 which could trigger additional runtime exceptions elsewhere due to missing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)
pkg/server/status.go
line 1516 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
FYI: you probably checked already but this could create some bad failure state in DB Console. It's used to getting an unimplemented 5xx error on this endpoint for tenants, and this will lead it to get a 200 which could trigger additional runtime exceptions elsewhere due to missing data.
Thanks for calling this out. Looking through the selectors that depend on this response, I think we're good, but I'll double-check with fresh eyes in the morning before merging.
pkg/server/fanout_clients.go
line 106 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Don't feel too strongly but since there's no matching impl for the
kvFanoutClient
, I'd just inline this in theNodes
handler.
Good idea, done. (Should keep your getAllNodes
/ nodesList
work easier to reason about, too!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian)
pkg/server/status.go
line 1516 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Thanks for calling this out. Looking through the selectors that depend on this response, I think we're good, but I'll double-check with fresh eyes in the morning before merging.
Found one, tested and fixed. Will merge on green. Thanks again, David!
Fixes #92065 Part of #89949 Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this `/_status/nodes` endpoint to get the information it needs. This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259. In the meantime, as a stop-gap measure, we implement a reduced version of this `/_status/nodes` endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants. Release note: None
bors r+ |
Build succeeded: |
89613: gossip: remove frequent gossiping of gossip client connections r=irfansharif a=a-robinson These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. These gossip-clients keys make up two thirds or more of the gossip traffic in various large clusters I've inspected, consuming almost an entire CPU core in the worst case I've seen. They don't provide enough value to justify that sort of ongoing cost, so this commit removes them entirely as well as the periodic logging of the gossip network and the crdb_internal.gossip_network table, both of which relied on them. Release note (backward-incompatible change): We've stopped supporting/populating the crdb_internal.gossip_network table. It was an internal table with no API guarantees (so perhaps no meriting a release note?). Release note (performance improvement): Significantly reduced CPU usage of the underlying gossip network in large clusters. --- Informs #51838 (largely fixes it for practical purposes, although there's likely still more that could be done) This is clearly going to break [the gossip roachtest](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/gossip.go#L50-L58), but between `@irfansharif` kindly volunteering to fix that up separately and his existing TODO in that file I've left that out of this change. I don't know if completely removing the gossip_network table is really the best idea or if it should just be left in and only populated with the clients from the local node. For example, when run in a mixed version cluster does `debug zip` run all of its sql commands against the local node or does it run some against remote nodes? If an old node ever tries to query the `gossip_network` table on a different node it could have a bad time. `@irfansharif` `@kvoli` 93985: ui: degrade gracefully when regions aren't known r=matthewtodd a=matthewtodd Part of #89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in #93268; we'll begin addressing it shortly. Release note: None 93989: leaktest: exclude long running logging goroutines r=srosenberg a=nicktrav The `leaktest` package detects potential goroutine leaks by snapshotting the set of goroutines running when `leaktest.AfterTest(t)` is called, returning a closure, and comparing the set of goroutines when the closure is called (typically `defer`'d). A race condition was uncovered in #93849 whereby logging-related goroutines that are scheduled by an `init` function in `pkg/util/logging` can sometimes be spawned _after_ the `AfterTest` function is run. When the test completes and the closure is run, the test fails due to a difference in the before / after goroutine snapshots. This mode of failure is deemed to be a false-positive. The intention of the logging goroutines are that they live for the duration of the process. However, exactly _when_ the goroutines scheduled in the `init` functions actually start run, and hence show up in the goroutine snapshots, is non-deterministic. Exclude the logging goroutines from the `leaktest` checks to reduce the flakiness of tests. Closes #93849. Release note: None. Epic: CRDB-20293 Co-authored-by: Alex Robinson <[email protected]> Co-authored-by: Matthew Todd <[email protected]> Co-authored-by: Nick Travers <[email protected]>
Part of cockroachdb#89949 Previously, when a tenant SQL instance had spun down (leaving us no way to remember which region it had been in), the SQL Activity pages would claim that statements and transactions had occurred in an "undefined" region. This change moves from saying "undefined" to saying nothing at all, a slightly nicer user experience. This broader problem of losing the region mapping has been described in cockroachdb#93268; we'll begin addressing it shortly. Release note: None
95449: sqlstats: include regions in statement_statistics r=matthewtodd a=matthewtodd Part of #89949. This completes the work abandoned in #92259, properly storing the regions in which a statement was executed, rather than waiting until UI render time to try to map (dangerously ephemeral) node IDs to their respective regions. With this work in place, we will be able to remove both the UI mapping code (usages of this [selector][]) and the stop-gap work of #93268. [selector]: https://github.com/cockroachdb/cockroach/blob/0f6333cbd6c78264a59dc435324c0c33d75ea148/pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.selectors.ts#L52-L64 Release note (sql change): A regions field was added to the statistics column of crdb_internal.statement_statistics, reporting the regions of the nodes on which the statement was executed. Co-authored-by: Matthew Todd <[email protected]>
96991: roachtest: update mixed-version backup to use new framework r=srosenberg a=renatolabs This updates the `backup/mixed-version` roachtest to use the recently introduced mixed-version roachtest framework (`mixedversion` package). The main behavior exercised remains the same: backups are taken in mixed-binary state, and those backups are restored and verified at the end of the test. However, this commit also improves the coverage of mixed-version backup testing in a few ways: * **Randomization**. By virtue of using the new framework, most runs will be different from one another since the order of actions taken by the test will be different. Previously, backups would always be taken with 2 nodes in the old version and 2 nodes in the new version. Now, backups can be taken when an arbitrary number of nodes is running the new version. As a consequence, it's also possible that some executions will attempt backups when all nodes are running a new binary version, but the cluster version itself has not been updated. Other points of new randomization include the choice of the node's external dir where backups are stored, which node to connect to when running certain statements, and how much to wait between backups. * **Backup Options**. Backups will randomly be created with `revision_history` enabled, or with an `encryption_passphrase`. * **Downgrades**. The cluster is also downgraded in mixed-version tests. No downgrades happened in that test before this commit. * **Workload**. Instead of using fixed call to `generate_series` to generate data between backups, the test now runs the `bank` workload continuously during the test. A random wait between backups allows the workload to make changes to the underlying table during the test and for the backups to be taken while writes are taking place. * **Finalization**: the test _may_ attempt to create a backup as the upgrade is finalizing (i.e., migrations are running and cluster version is advancing). In addition, this test will also see improved coverage as we make more improvements to test plans generated by the `mixedversion` package. These changes will create more backup scenarios in the future without requiring any code changes to this test. This test has already helped us uncover one backup bug (#97953). Epic: CRDB-19321 Release note: None 98398: statusccl: stop serving /_status/nodes to tenants r=matthewtodd a=matthewtodd Fixes #98057. This reverts the work of #93268, which is no longer necessary now that we are eagerly capturing region information at execution time in #95449. Release note: None Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Matthew Todd <[email protected]>
Fixes #92065
Part of #89949
Our current UI architecture has many of the SQL Activity pages mapping node or sql instance ids to regions at page load time, using the response from this
/_status/nodes
endpoint to get the information it needs.This approach is problematic for ephemeral serverless tenants but will need a broader reimagining, probably ending up with us storing locality information directly in many of the sqlstats tables, which will probably require something like sending instance localities along in the distsql tracing. An initial exploration in this direction happened in #92259.
In the meantime, as a stop-gap measure, we implement a reduced version of this
/_status/nodes
endpoint for tenants, keeping the SQL Activity pages somewhat working for multiregion tenants.Release note: None